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Abstract 

As a testament to their success, the theory 
of random forests has long been outpaced by 
their application in practice. In this paper, 
we take a step towards narrowing this gap 
by providing a consistency result for online 
random forests. 



1. Introduction 

Random forests are a class of ensemble method whose 
base learners are a collection of randomized tree 
predictors, which are combined through averaging. 
The original random forests framework described in 
Brciman (2001) has been extremely influential (Svet- 
nik ct al., 2003; Prasad et al., 2006; Cutler ct al., 2007; 
Shotton et al., 2011; Criminisi ct al., 2011). 

Despite their extensive use in practical settings, very 
little is known about the mathematical properties of 
these algorithms. A recent paper by one of the leading 
theoretical experts states that 

Despite growing interest and practical use, 
there has been little exploration of the sta- 
tistical properties of random forests, and lit- 
tle is known about the mathematical forces 
driving the algorithm (Biau, 2012). 

Theoretical work in this area typically focuses on styl- 
ized versions of the random forests algorithms used in 
practice. For example, Biau et al. (2008) prove the 
consistency of a variety of ensemble methods built by 
averaging base classifiers. Two of the models they 
study are direct simplifications of the forest growing 
algorithms used in practice; the others are stylized 
neighbourhood averaging rules, which can be viewed 



as simplifications of random forests through the lens 
of Lin & Jeon (2002). 

In this paper we make further steps towards narrowing 
the gap between theory and practice. In particular, we 
present what is to the best of our knowledge the first 
consistency result for online random forests. 

We show that the theory provides guidance for de- 
signing online random forest algorithms. A few simple 
experiments with our algorithm confirm the require- 
ments for consistency predicted by the theory. The 
experiments also highlight some theoretical and prac- 
tical problems that need to be addressed. 

2. Related Work 

Different variants of random forests are distinguished 
by the methods they use for growing the trees. The 
model described in Breiman (2001) builds each tree 
on a bootstrapped sample of the training set using the 
CART methodology (Breiman et al., 1984). The opti- 
mization in each leaf that searches for the optimal split 
point is restricted to a random selection of features, or 
linear combinations of features. 

The framework of Criminisi et al. (2011) operates 
slightly differently. Instead of choosing only features at 
random, this framework chooses entire decisions (i.e. 
both a feature or combination of features and a thresh- 
old together) at random and optimizes only over this 
set. They also offer a variety of different objectives 
which can be optimized to split each leaf, depending 
on the task at hand (e.g. classification vs manifold 
learning). Unlike the work of Breiman (2001), this 
framework chooses not to include bagging, preferring 
instead to train each tree on the entire data set and in- 
troduce randomness only in the splitting process. The 
authors argue that without bagging their model ob- 
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tains max-margin properties. 

In addition to the frameworks mentioned above, many 
practitioners introduce their own variations on the ba- 
sic random forests algorithm, tailored to their specific 
problem domain. A variant from Bosch et al. (2007) 
is especially similar to the technique we use in this pa- 
per: When growing a tree the authors randomly select 
one third of the training data to determine the struc- 
ture of the tree and use the remaining two thirds to 
fit the leaf estimators. However, the authors consider 
this only as a technique for introducing randomness 
into the trees, whereas in our model the partitioning 
of data plays a central role in consistency. 

In addition to these offline methods, several re- 
searchers have focused on building online versions of 
random forests. Online models are attractive because 
they do not require that the entire training set be ac- 
cessible at once. These models are appropriate for 
streaming settings where training data is generated 
over time and should be incorporated into the model 
as quickly as possible. Several variants of online de- 
cision tree models are present in the MOA system of 
Bifet ct al. (2010). 

The primary difficulty with building online decision 
trees is their recursive nature. Data encountered once 
a split has been made cannot be used to correct earlier 
decisions. A notable approach to this problem is the 
Hoeffding tree (Domingos & Hultcn, 2000) algorithm, 
which works by maintaining several candidate splits in 
each leaf. The quality of each split is estimated online 
as data arrive in the leaf, but since the entire training 
set is not available these quality measures are only es- 
timates. The Hoeffding bound is employed in each leaf 
to control the amount of data which must be collected 
to ensure that the split chosen on the basis of these 
estimates is the true best split with high probability. 
Domingos & Hultcn (2000) prove that under reason- 
able assumptions the online Hoeffding tree converges 
to the offline tree with high probability. The Hoeffding 
tree algorithm is implemented in the system of Bifet 
ct al. (2010). 

Alternative methods for controlling tree growth in an 
online setting have also been explored. Saffari et al. 
(2009) use the online bagging technique of Oza & Rus- 
sel (2001) and control leaf splitting using two param- 
eters, in their online random forest. One parameter 
specifies the minimum number of data points which 
must be seen in a leaf before it can be split, and an- 
other specifies a minimum quality threshold that the 
best split in a leaf must reach. This is similar in flavor 
to the technique used by Hoeffding trees, but trades 
theoretical guarantees for more interpretable parame- 



ters. 

One active avenue of research in online random forests 
involves tracking non-stationary distributions, also 
known as concept drift. Many of the online techniques 
incorporate features designed for this problem (Gama 
ct al., 2005; Abdulsalam, 2008; Saffari et al., 2009; 
Bifet ct al., 2009; 2012). However, tracking of non- 
stationarity is beyond the scope of this paper. 

The most well known theoretical result for random 
forests is that of Breiman (2001), which gives an up- 
per bound on the generalization error of the forest in 
terms of the correlation and strength of trees. Fol- 
lowing Breiman (2001), an interpretation of random 
forests as an adaptive neighborhood weighting scheme 
was published in Lin & Jcon (2002). This was fol- 
lowed by the first consistency result in this area from 
Breiman (2004), which proves consistency of a simpli- 
fied model of the random forests used in practice. In 
the context of quantile regression the consistency of 
a certain model of random forests has been shown by 
Meinshausen (2006). A model of random forests for 
survival analysis was shown to be consistent in Ish- 
waran & Kogalur (2010). 

Significant recent work in this direction comes from 
Biau ct al. (2008) who prove the consistency of a vari- 
ety of ensemble methods built by averaging base clas- 
sifiers, as is done in random forests. A key feature 
of the consistency of the tree construction algorithms 
they present is a proposition that states that if the 
base classifier is consistent then the forest, which takes 
a majority vote of these classifiers, is itself consistent. 

The most recent theoretical study, and the one which 
achieves the closest match between theory and prac- 
tice, is that of Biau (2012). The most significant way 
in which their model differs from practice is that it 
requires a second data set which is not used to fit the 
leaf predictors in order to make decisions about vari- 
able importance when growing the trees. One of the 
innovations of the model we present in this paper is a 
way to circumvent this limitation in an online setting 
while maintaining consistency. 

3. Random Forests 

In this section we briefly review the random forests 
framework. For a more comprehensive review we re- 
fer the reader to Breiman (2001) and Criminisi et al. 
(2011). 

Random forests are built by combining the predictions 
of several trees, each of which is trained in isolation. 
Unlike in boosting (Schapire &; Freund, 2012) where 
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the base classifiers are trained and combined using a 
sophisticated weighting scheme, typically the trees are 
trained independently and the predictions of the trees 
are combined through a simple majority vote. 

There are three main choices to be made when con- 
structing a random tree. These are (1) the method for 
splitting the leafs, (2) the type of predictor to use in 
each leaf, and (3) the method for injecting randomness 
into the trees. 

Specifying a method for splitting leafs requires select- 
ing the shapes of candidate splits as well as a method 
for evaluating the quality of each candidate. Typical 
choices here are to use axis aligned splits, where data 
are routed to sub-trees depending on whether or not 
they exceed a threshold value in a chosen dimension; or 
linear splits, where a linear combination of features are 
thresholded to make a decision. The threshold value 
in either case can be chosen randomly or by optimizing 
a function of the data in the leafs. 
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Figure 1. Three potential splits for a leaf node and the class 
histograms for the children each split would create. The 
rightmost split creates the purest children and will have 
the greatest information gain. 

In order to split a leaf, a collection of candidate splits 
are generated and a criterion is evaluated to choose 
between them. A simple strategy is to choose among 
the candidates uniformly at random, as in the mod- 
els analyzed in Biau et al. (2008). A more common 
approach is to choose the candidate split which opti- 
mizes a purity function over the leafs that would be 
created. Typical choices here are to maximize the in- 
formation gain, or the Gini gain (Hastie et al., 2013). 
This situation is illustrated in Figure 1. 

The most common choice for predictors in each leaf 
is to use the majority vote over the training points 
which fall in that leaf. Criminisi et al. (2011) explore 
the use of several different leaf predictors for regression 
and manifold learning, but these generalizations are 
beyond the scope of this paper. We consider majority 
vote classifiers in our model. 

Injecting randomness into the tree construction can 



happen in many ways. The choice of which dimensions 
to use as split candidates at each leaf can be random- 
ized, as well as the choice of coefficients for random 
combinations of features. In either case, thresholds 
can be chosen either randomly or by optimization over 
some or all of the data in the leaf. 

Another common method for introducing randomness 
is to build each tree using a bootstrapped or sub- 
sampled data set. In this way, each tree in the forest 
is trained on slightly different data, which introduces 
differences between the trees. 

4. Online Random Forests with Stream 
Partitioning 

In this section we describe the workings of our online 
random forest algorithm. A more precise (pseudo- 
code) description of the training procedure can be 
found in Appendix A. 

4.1. Forest Construction 

The random forest classifier is constructed by building 
a collection of random tree classifiers in parallel. Each 
tree is built independently and in isolation from the 
other trees in the forest. Unlike many other random 
forest algorithms we do not preform bootstrapping or 
subsampling at this level; however, the individual trees 
each have their own optional mechanism for subsam- 
pling the data they receive. 

4.2. Tree Construction 

Each node of the tree is associated with a rectangular 
subset of R 15 , and at each step of the construction 
the collection of cells associated with the leafs of the 
tree forms a partition of R D . The root of the tree 
is M. D itself. At each step we receive a data point 
(X t , Y t ) from the environment. Each point is assigned 
to one of two possible streams at random with fixed 
probability. We denote stream membership with the 
variable I t & {s,e}. How the tree is updated at each 
time step depends on which stream the corresponding 
data point is assigned to. 

We refer to the two streams as the structure stream 
and the estimation stream; points assigned to these 
streams are structure and estimation points, respec- 
tively. These names reflect the different uses of the 
two streams in the construction of the tree: 

Structure points are allowed to influence the struc- 
ture of the tree partition, i.e. the locations of candidate 
split points and the statistics used to choose between 
candidates, but they are not permitted to influence the 
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predictions that are made in each leaf of the tree. 

Estimation points are not permitted to influence the 
shape of the tree partition, but can be used to estimate 
class membership probabilities in whichever leaf they 
are assigned to. 

Only two streams are needed to build a consistent for- 
est, but there is no reason we cannot have more. For 
instance, we explored the use of a third stream for 
points that the tree should ignore completely, which 
gives a form of online sub-sampling in each tree. We 
found empirically that including this third stream 
hurts performance of the algorithm, but its presence 
or absence does not affect the theoretical properties. 

4.3. Leaf Splitting Mechanism 

When a leaf is created the number of candidate 
split dimensions for the new leaf is set to min(l + 
Poisson(A), D), and this many distinct candidate di- 
mensions are selected uniformly at random. We then 
collect m candidate splits in each candidate dimen- 
sion (m is a parameter of the algorithm) by projecting 
the first 7Ji structure points to arrive in the newly cre- 
ated leaf onto the candidate dimensions. We maintain 
several structural statistics for each candidate split. 
Specifically, for each candidate split we maintain class 
histograms for each of the new leafs it would create, us- 
ing data from the estimation stream. We also maintain 
structural statistics, computed from data in the struc- 
ture stream, which can be used to choose between the 
candidate splits. The specific form of the structural 
statistics does not affect the consistency of our model, 
but it is crucial that they depend only on data in the 
structure stream. 

Finally, we require two additional conditions which 
control when a leaf at depth d is split: 

1. Before a candidate split can be chosen, the class 
histograms in each of the leafs it would create 
must incorporate information from at least a(d) 
estimation points. 

2. If any leaf receives more than (3(d) estimation 
points, and the previous condition is satisfied for 
any candidate split in that leaf, then when the 
next structure point arrives in this leaf it must 
be split regardless of the state of the structural 
statistics. 

The first condition ensures that leafs are not split 
too often, and the second condition ensures that no 
branch of the tree ever stops growing completely. The 
only requirements on the functions a(d) and (3(d) are 



that they must both grow unboundedly in d and that 
/3(d) > a(d). 

When a structure point arrives in a leaf, if the first 
condition is satisfied for some candidate split then the 
leaf may optionally be split at the corresponding point. 
The decision of whether to split the leaf or wait to 
collect more data is made on the basis of the structural 
statistics collected for the candidate splits in that leaf. 

4.4. Structural Statistics 

In each candidate child we maintain an estimate of the 
posterior probability of each class, as well as the total 
number of points we have seen fall in the candidate 
child, both counted from the structure stream. In or- 
der to decide if a leaf should be split, we compute the 
information gain for each candidate split which satis- 
fies condition 1 from the previous section, 



I(S) = H(A) 



\A\ 



H(A') 



H(A") 



Here S is the candidate split, A is the cell belonging 
to the leaf to be split, and A' and A" are the two 
leafs that would be created if A were split at S. The 
function H(A) is the discrete entropy, computed over 
the labels of the structure points which fall in the cell 
A. 

We select the candidate split with the largest informa- 
tion gain for splitting, provided this split achieves a 
minimum threshold in information gain, r. The value 
of r is a parameter of our algorithm. 

4.5. Prediction 

At any time the online forest can be used to make 
predictions for unlabelled data points using the model 
built from the labelled data it has seen so far. To make 
a prediction for a query point x at time t, each tree 
computes, for each class k, 



N°(A t (x)) 



(X T ,Y T )EA t (x) 
I T =e 



where A t (x) denotes the leaf of the tree containing x 
at time t, and N e (A t (x)) is the number of estimation 
points which have been counted in A t (x) during its 
lifetime. Similarly, the sum is over the labels of these 
points. The tree prediction is then the class which 
maximizes this value: 

g t {x) = argmax{77 t fe (x)} . 

k 

The forest predicts the class which receives the most 
votes from the individual trees. 
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Note that this requires that we maintain class his- 
tograms from both the structure and estimation 
streams separately for each candidate child in the 
fringe of the tree. The counts from the structure 
stream are used to select between candidate split 
points, and the counts from the estimation stream are 
used to initialize the parameters in the newly created 
leafs after a split is made. 

5. Theory 

In this section we state our main theoretical results and 
give an outline of the strategy for establishing consis- 
tency of our online random forest algorithm. In the 
interest of space and clarity we do not include proofs 
in this section. Unless otherwise noted, the proofs of 
all claims appear in Appendix B. 

We denote the tree partition created by our online ran- 
dom forest algorithm from t data points as g t . As t 
varies we obtain a sequence of classifiers, and we are 
interested in showing that the sequence {gt} is consis- 
tent, or more precisely that the probability of error of 
gt converges in probability to the Bayes risk, i.e. 

L(g t )=V(g t (X,Z)^Y\D t )^L* , 

as t — > co. Here (X, Y) is a random test point and Z 
denotes the randomness in the tree construction algo- 
rithm. D t is the training set (of size t) and the proba- 
bility in the convergence is over the random selection 
of D t . The Bayes risk is the probability of error of 
the Bayes classifier, which is the classifier that makes 
predictions by choosing the class with the highest pos- 
terior probability, 

g(x) = arg maxP (Y = k | X = x) , 

(where ties are broken in favour of the smaller index) . 
The Bayes risk L(g) = L* is the minimum achievable 
risk of any classifier for the distribution of (X, Y) . In 
order to ease notation, we drop the explicit dependence 
on D t in the remainder of this paper. More informa- 
tion about this setting can be found in Dcvroye et al. 
(1996). 

Our main result is the following theorem: 

Theorem 1. Suppose the distribution of X has a den- 
sity with respect to the Lebesgue measure and that this 
density is bounded from above and below. Then the 
online random forest classifier described in this paper 
is consistent. 

The first step in proving Theorem 1 is to show that the 
consistency of a voting classifier, such as a random for- 
est, follows from the consistency of the base classifiers. 



We prove the following proposition, which is a straight- 
forward generalization of a proposition from Biau et al. 
(2008), who prove the same result for two class ensem- 
bles. 

Proposition 2. Assume that the sequence {gt} of ran- 
domized classifiers is consistent for a certain distribu- 
tion of (X, Y). Then the voting classifier, g[ M ^ ob- 
tained by taking the majority vote over M (not nec- 
essarily independent) copies of g t is also consistent. 

With Proposition 2 established, the remainder of the 
effort goes into proving the consistency of our tree con- 
struction. 

The first step is to separate the stream splitting ran- 
domness from the remaining randomness in the tree 
construction. We show that if a classifier is condition- 
ally consistent based on the outcome of some random 
variable, and the sampling process for this random 
variable generates acceptable values with probability 
1, then the resulting classifier is unconditionally con- 
sistent. 

Proposition 3. Suppose {g t } is a sequence of classi- 
fiers whose probability of error converges conditionally 
in probability to the Bayes risk L* for a specified dis- 
tribution on (X, Y), i.e. 

V( 9t (X, Z,I)^Y\I)^L* 

for all I £ X and that v is a distribution on I. If 
v(T) = 1 then the probability of error converges un- 
conditionally in probability, i.e. 

¥(g t (X,Z,I)^Y)^L* 

In particular, {g t } is consistent for the specified distri- 
bution. 

Proposition 3 allows us to condition on the random 
variables {It}^Li which partition the data stream into 
structure and estimation points in each tree. Provided 
that the random partitioning process produces accept- 
able sequences with probability 1, it is sufficient to 
show that the random tree classifier is consistent con- 
ditioned on such a sequence. In particular, in the re- 
mainder of the argument we assume that {It}tL\ is a 
fixed, deterministic sequence which assigns infinitely 
many points to each of the structure and estimation 
streams. We refer to such a sequence as a partitioning 
sequence. 

The reason this is useful is that conditioning on a par- 
titioning sequence breaks the dependence between the 
structure of the tree partition and the estimators in 
the leafs. This is a powerful tool because it gives us 
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Figure 2. The dependency structure of our algorithm. S 
represents the randomness in the structure of the tree par- 
tition, E represents the randomness in the leaf estimators 
and / represents the randomness in the partitioning of the 
data stream. E and S are independent conditioned on I. 

access to a class of consistency theorems which rely 
on this type of independence. However, before we are 
able to apply these theorems we must further reduce 
our problem to proving the consistency of estimators 
of the posterior distribution of each class. 

Proposition 4. Suppose we have regression esti- 
mates, rfi(x), for each class posterior Tj (x) = 
P (Y = k | X = x), and that these estimates are each 
consistent. The classifier 

g t (x) = argmax{?7 t fe (a;)} 

k 

(where ties are broken in favour of the smaller index) 
is consistent for the corresponding multiclass classifi- 
cation problem. 

Proposition 4 allows us to reduce the consistency of 
the multiclass classifier to the problem of proving the 
consistency of several two class posterior estimates. 
Given a set of classes {1, . . . , c} we can re-assign the 
labels using the map (X, Y) M> (X. I{Y = k}) for any 
k G {1, . . . , c} in order to get a two class problem where 
P (Y = 1 1 X = x) in this new problem is equal to rj k (x) 
in the original multiclass problem. Thus to prove con- 
sistency of the multiclass classifier it is enough to show 
that each of these two class posteriors is consistent. To 
this end we make use of the following theorem from De- 
vroye et al. (1996). 

Theorem 5. Consider a partitioning classification 
rule which builds a prediction rjt(x) of r)(x) = 
P (Y = 1 1 X = x) by averaging the labels in each cell 
of the partition. If the labels of the voting points do 
not influence the structure of the partition then 

E[|fft(z)-»7(aO|]->0 

provided that 

1. diam(/l t (A)) — > in probability, 

2. N e (A t (X)) ->• oo in probability. 



Proof. See Theorem 6.1 in Devroye et al. (1996). □ 



Here A t (X) refers to the cell of the tree partition con- 
taining a random test point X, and diam(yl) indicates 
the diameter of set A. The diameter is defined as the 
maximum distance between any two points falling in 
A, 

diam(A) = sup — y\\ . 

x,y£A 

The quantity N e (A t (X)) is the number of points con- 
tributing to the estimation of the posterior at X. 

This theorem places two requirements on the cells of 
the partition. The first condition ensures that the cells 
are sufficiently small that small details of the posterior 
distribution can be represented. The second condition 
requires that the cells be large enough that we are 
able to obtain high quality estimates of the posterior 
probability in each cell. 

The leaf splitting mechanism described in Section 4.3 
ensures that the second condition of Theorem 5 is sat- 
isfied. However, showing that our algorithm satisfies 
the first condition requires significantly more work. 
The chief difficulty lies in showing that every leaf of the 
tree will be split infinitely often in probability. Once 
this claim is established a relatively straightforward 
calculation shows that the expected size of each di- 
mension of a leaf is reduced each time it is split. These 
details are somewhat technical, so we refer the inter- 
ested reader to Appendix B for more information, as 
well as the proofs of the propositions stated in this 
section. 

6. Experiments 

In this section we demonstrate some empirical results 
on simple problems in order to illustrate the properties 
of our algorithm. We also provide a comparison to an 
existing online random forest algorithm. We plan to 
release code to reproduce all of the experiments in this 
section in the near future. 

6.1. Advantage of a Forest 

Our first experiment demonstrates that although the 
individual trees are consistent classifiers, empirically 
the performance of the forest is significantly better 
than each of the trees for problems with finite data. 
We demonstrate this on a synthetic five class mixture 
of Gaussians problem with significant class overlap and 
variation in prior weights. 

From Figure 3 it is clear that the forest converges much 
more quickly than the individual trees. Result profiles 
of this kind are common in the boosting and random 
forests literature; however, in practice one often uses 
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Figure 3. Prediction accuracy of the forest and the trees it 
averages on a simple mixture of Gaussians problem. The 
horizontal line shows the accuracy of the Bayes classifier 
on this problem. We see that the accuracy of the forest 
consistently dominates the expected accuracy of the trees. 
The forest in this example contains 100 trees. Error regions 
show one standard deviation computed over 10 runs. 



Figure 4. Excess error above the Bayes risk for a simple 
synthetic problem. The solid line shows the excess error for 
a forest where each tree is built to full depth. The dashed 
line shows a forest where each tree requires 2 d examples in 
a leaf at level d in order to split. Both forests contain 100 
trees. 



inconsistent base classifiers in the ensemble (e.g. boost- 
ing with decision stumps or random forests where the 
trees are grown to full size). This experiment demon- 
strates that although our base classifiers provably con- 
verge, empirically there is still a benefit from averaging 
in finite time. 

6.2. Growing leaves 

Our next experiment demonstrates the importance of 
the condition that a(d) —¥ oo, i.e. having the num- 
ber of data points in each leaf grow over time. We 
demonstrate this using a synthetic two class distri- 
bution specifically designed to exhibit problems when 
a(d) does not grow. 

In the distribution we construct, P (X = x) is uni- 
form on the unit square in K 2 , and the posterior 
P (Y = 1 1 X = x) = 0.5001 for all x. Figure 4 shows 
the excess error of two forests trained on several data 
sets of different sizes sampled from this distribution. 
In one of the forests the trees are grown to full depth, 
while in the other we force the size of the leafs to in- 
crease with their depth in the tree. 

As can be seen in Figure 4, building trees to full depth 
prevents the forest from making progress towards the 
Bayes error over a huge range of data set sizes, whereas 
the forest composed of trees with growing leafs steadily 
decreases its excess error. 

Admittedly, this scenario is quite artificial, and it can 
be difficult to find real problems where the difference 



is so pronounced. It is still an open question as to 
whether a forest can be made consistent by averaging 
over an infinite number of trees of full depth (although 
see Breiman (2004) and Biau (2012) for results in this 
direction) . The purpose of this example is to show that 
in the common scenario where the number of trees is 
a fixed parameter of the algorithm, having leafs that 
grow over time is important. 

6.3. Comparison to Offline 

In our third experiment, we demonstrate that our on- 
line algorithm is able to achieve similar performance 
to an offline implementation of random forests and 
also compare to an existing online random forests al- 
gorithm on a small non-synthetic problem. 

In particular, we demonstrate this on the USPS data 
set from the LibSVM repository (Chang & Lin, 2011). 
We have chosen the USPS data for this experiment 
because it allows us to compare our results directly to 
those of Saffari ct al. (2009), whose algorithm is very 
similar to our own. In the interest of comparability 
we also use a forest of 100 trees and set the minimum 
information gain threshold (r in our model) to 0.1. 
We show results from both online algorithms with 10 
passes through the data. 

Figure 5 shows that we are able to achieve performance 
very similar to the offline random forest on the full 
data. The performance we achieve is identical to the 
performance reported by Saffari et al. (2009) on this 
data set. 
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Figure 5. Comparison between offline random forests and 
our online algorithm on the USPS data set. The online 
forest uses 10 passes through the data set. The third line 
is our implementation of the algorithm from Saffari et al. 
(2009); the performance shown here is identical to what 
they report. Error regions show one standard deviation 
computed over 10 runs. 



7. Discussion and Future Work 

In this paper we described an algorithm for building 
online random forests and showed that our algorithm 
is consistent. To the best of our knowledge this is the 
first consistency result for online random forests. 

The theory guides certain choices made when design- 
ing our algorithm, notably that it is necessary for the 
leafs in each tree to increase in size over time. Our 
experiments on simple problems confirm that this re- 
quirement is important. 

There are two major difficulties to overcome when 
building random forests online, both of which stem 
from the fact that decision trees arc recursive struc- 
tures. When a data point is received in an online set- 
ting it must be discarded before the next data point 
arrives. This means that a data point received before 
a split is created cannot be used to update the esti- 
mators in leafs further down the tree, since it must be 
discarded before those leafs are created. Conversely, a 
point received late in the training process cannot be 
used to adjust the split points near the root of the tree, 
since the estimators in the adjacent subtrees depend 
on the current position of the split. 

The typical approach to building trees online, which 
is employed in Domingos & Hulten (2000) and Saf- 
fari et al. (2009), is to maintain a fringe of candidate 
children in each leaf of the tree. The algorithm col- 
lects statistics in each of these candidate children until 
some (algorithm dependent) criterion is met, at which 



point a pair of candidate children is selected to replace 
their parent. The selected children become leafs in the 
new tree, acquiring their own candidate children, and 
the process repeats. Our algorithm also uses this ap- 
proach. 

The difficulty here is that the trees are grown breadth 
first, and maintaining the fringe of potential children 
becomes very memory intensive when the trees are 
large. Our algorithm also suffers from this deficiency, 
as maintaining the fringe requires 0(cmd) statistics in 
each leaf, where d is the number of candidate split di- 
mensions, m is the number of candidate split points 
(i.e. md pairs of candidate children per leaf) and c 
is the number of classes in the problem. The num- 
ber of leafs grows exponentially fast with tree depth, 
meaning that for deep trees this memory cost becomes 
prohibitive. 

Offline forests do not suffer from this problem, because 
they are able to grow the trees depth first. Since they 
do not need to accumulate statistics for more than 
one leaf at a time, the cost of computing even several 
megabytes of statistics per split is negligible. 

In practice this memory problem is resolved by grow- 
ing small trees, as in Saffari et al. (2009), or by bound- 
ing the number of nodes in the fringe of the tree, as in 
Domingos & Hulten (2000). Other models of stream- 
ing random forests, such as those discussed in Abdul- 
salam (2008), build trees in sequence instead of in par- 
allel, which reduces the total memory usage. 

In this paper, we do not address the memory re- 
quirement issue. We believe that memory efficient 
data structures with strong theoretical bounds, such as 
count-min sketch (Cormode & Muthukrishnan, 2005; 
Cormode, 2011) could be used to alleviate this prob- 
lem. Combining their theoretical bounds with ours 
makes for an interesting direction for future research. 

Finally, our current algorithm is restricted to axis 
aligned splits. Many implementations of random 
forests use more elaborate split shapes, such as random 
linear or quadratic combinations of features. These 
strategies can be highly effective in practice, especially 
in sparse or high dimensional settings. Understanding 
how to maintain consistency in these settings is an- 
other potentially interesting direction of inquiry. 
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A. Algorithm pseudo-code 

Candidate split dimension A dimension along which a split may be made. Each leaf selects min(l + 
Poisson(A), D) of these when it is created. 

Candidate split point One of the first m structure points to arrive in a leaf. 

Candidate split A combination of a candidate split dimension and a position along that dimension to split. 
These are formed by projecting each candidate split point into each candidate split dimension. 

Candidate children Each candidate split in a leaf induces two candidate children for that leaf. These are also 
referred to as the left and right child of that split. 

N e (A) is a count of estimation points in the cell A, and Y e (A) is the histogram of labels of these points in A. 

N S (A) is a count of structure point in the cell A, and Y S (A) is the histogram of labels of these points in A. 



Algorithm 1 BuildTree 

Require: Initially the tree has exactly one leaf (TreeRoot) which covers the whole space 
Require: The dimensionality of the input, D. Parameters A, m and r. 
SelectCandidateSplitDimcnsions(TreeRoot, min(l + Poisson(A), D)) 
for t = 1 . . . do 

Receive (X t ,Y t ,It) from the environment 
A t leaf containing X t 
if I t = estimation then 

UpdatcEstimationStatistics(A t , (X tl Y t j) 
for all S e CandidateSplits(A t ) do 
for all A e CandidateChildren(S') do 
if X t e A then 

UpdateEstimationStatistics(^4, (X t , Y t )) 
end if 
end for 
end for 
else if I t = structure then 

if A t has fewer than m candidate split points then 
for all d € CandidatcSplitDimcnsions(^l t ) do 

CreateCandidateSplit(A t , d,TTdX t ) 
end for 
end if 

for all S e CandidateSplits(A 4 ) do 
for all A e CandidateChildren(5') do 
if X t e A then 

UpdateStructuralStatistics(A, (X t , Y t )) 
end if 
end for 
end for 

if CanSplit(A t ) then 
if ShouldSplit(A t ) then 

S P lit(A t ) 
else if MustSplit(A t ) then 

Split(A t ) 
end if 
end if 
end if 
end for 
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Algorithm 2 Split 
Require: A leaf A 

S <- BestSplit(A) 

A' <- LeftChild(A) 

SelectCandidateSplitDimensions(A', min(l + 

Poisson(A),L>)) 

A" <- RightChild(yl) 

SelectCandidateSplitDimensions(A", min(l + 

Poisson(A),L>)) 

return A', A" 



Algorithm 3 CanSplit 
Require: A leaf A 
d <- Dcpth(A) 

for all S e CandidateSplits(A) do 
if SplitIsValid(A, S) then 

return true 
end if 

end for 

return false 



Algorithm 4 SplitlsValid 
Require: A leaf A 
Require: A split S 

d <- Dcpth(A) 

A' «- LeftChild(S*) 

A" <- RightChild(S') 

return 7V e (A') > a(d) and N e (A") > a(d) 



Algorithm 5 MustSplit 
Require: A leaf A 

d <- Dcpth(A) 

return N e (A) > /3(d) 



Algorithm 6 ShouldSplit 
Require: A leaf A 

for all S e CandidateSplits(A) do 
if InformationGain(5) > t then 
if SplitIsValid(A, S) then 

return true 
end if 
end if 
end for 
return false 



Algorithm 7 BcstSplit 
Require: A leaf A 

Require: At least one valid candidate split exists for 
A 

best_split <— none 

for all S € CandidateSplits(A) do 

if InformationGain(A, S) > InformationGain(A, 
best split) then 

if SplitIsValid(A, S) then 

best .split <— S 
end if 
end if 
end for 

return best split 



Algorithm 8 InformationGain 



Require: A leaf A 
Require: A split S 

A' <- LeftChild(S*) 

A" <- RightChild(S') 

return Entropy {Y s (A))- Entropy (Y s ( A') ) - 

J ^Entropy(l-(A")) 



Algorithm 9 UpdateEstimationStatistics 
Require: A leaf A 
Require: A point (X, Y) 

N e (A) <- N e (A) + 1 

Y e {A) <- Y e (A) +Y 



Algorithm 10 UpdatcStructuralStatistics 
Require: A leaf A 
Require: A point (A", Y) 

N S (A) <- N S {A) + 1 

Y S {A) <r- Y S (A)+Y 
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B. Proof of Consistency 
B.l. A note on notation 

A will be reserved for subsets of M. D , and unless otherwise indicated it can be assumed that A denotes a cell 
of the tree partition. We will often be interested in the cell of the tree partition containing a particular point, 
which we denote A(x). Since the partition changes over time, and therefore the shape of A(x) changes as well, 
we use a subscript to disambiguate: A t (x) is the cell of the partition containing x at time t. Cells in the tree 
partition have a lifetime which begins when they are created as a candidate child to an existing leaf and ends 
when they are themselves split into two children. When referring to a point X T g A t (x) it is understood that r 
is restricted to the lifetime of A t (x). 

We treat cells of the tree partition and leafs of the tree defining it interchangeably, denoting both with an 
appropriately decorated A. 

N generally refers to the number of points of some type in some interval of time. The various decorations the 
A receives specify which particular type of point or interval of time is being considered. A superscript always 
denotes type, so N k refers to a count of points of type k. Two special types, e and s, are used to denote 
estimation and structure points, respectively. Pairs of subscripts are used to denote time intervals, so N k b 
denotes the number of points of type k which appear during the time interval [a, b\. We also use N as a function 
whose argument is a subset of R D in order to restrict the counting spatially: N£ b (A) refers to the number of 
estimation points which fall in the set A during the time interval [a, b] . We make use of one additional variant 
of A as a function when its argument is a cell in the partition: when we write N k (A t (x)), without subscripts on 

A, the interval of time we count over is understood to be the lifetime of the cell A t (x). 

B. 2. Preliminaries 

Lemma 6. Suppose we partition a stream of data into c parts by assigning each point (X t ,Y t ) to part I t € 
{1, . . . , c} with fixed probability pk, meaning that 

b 

N* >b = J2HIt = k} . (1) 

t=a 

Then with probability 1, N k b — > oo for all k G {1, . . . , c} as b — a — > oo. 

Proof. Note that P (It = 1) = pi and these events are independent for each t. By the second Borel-Cantelli 
lemma, the probability that the events in this sequence occur infinitely often is 1. The cases for I t £ {2, . . . , c} 
are similar. □ 

Lemma 7. Let X t be a sequence of iid random variables with distribution /i, let A be a fixed set such that 
[i(A) > and let {It} be a fixed partitioning sequence. Then the random variable 



N k a M)= J2 HXteA} 

a<t<b:I t =k 

is Binomial with parameters N k b and n(A). In particular, 



N k a ,(A) < ^A < exp (-^N* 



2 a0 J ~ \ 2 

which goes to as b — a — > oo, where N{* b is the deterministic quantity defined as in Equation 1. 

Proof. N k b (A) is a sum of iid indicator random variables so it is Binomial. It has the specified parameters 
because it is a sum over N k b elements and V(X t G A) = fi(A). Moreover, E A£ 6 (A) = ^(A)N k b so by 
Hocffding's inequality we have that 

P (A£ 6 (A) < E [Nl b {A)] - eN k atb ) = P (N^ b (A) < N k a ^{A) - e)) < exp (-2e 2 A Q fc & ) . 
Setting e = \n(A) gives the desired result. □ 
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B.3. Proof of Proposition 2 

Proof. Let g(x) denote the Bayes classifier. Consistency of {gt} is equivalent to saying that E [£(<?t)] = 
P (g t (X, Z) ^Y) -S- L*. In fact, since P (g t (X, Z) j= Y \ X = x) > P (g(X) ^ Y \ X = x) for all x G M D , consis- 
tency of {<7t} means that for /i-almost all x, 

P ( 5t (X, Z) ± Y | X = x) -> P (<?(X) ^ y | X = x) = 1 - max^x)} 

k 

Define the following two sets of indices 

G = {k\ r} k (x) = max{rj k (x)}} , 

k 

B = {k\r] k (x) < max{77 fc (x)}} . 

k 

Then 

V(g t (X, Z)^Y\X = x) = Y J ^(9t{X,Z) = k\X = x)V{Y ^k\X = x) 

k 

< (1 - max{?7 fe (x)}) ^ P (,g f (X, Z) = k \ X = x) + £ P ( 5t (X, Z) = fc | X = x) , 

k£G keB 

which means it suffices to show that P \ a\ M \x, Z AI ) = k \ X — x^j — > for all k € B. However, using Z M to 
denote M (possibly dependent) copies of Z, for all keB, 

(M M 
J2 1 {9t (x, Zj ) = k}> max £ I {g t (x, Z j ) = c} 
i=i i=i 

(M 
^I{ 9t (x,^) = fe}>l 



By Markov's inequality, 



M 

<E ]Tl{ gt (x,Z,) = fc} 
= MP( 5i ( I ,Z) = fc)^0 



□ 



B.4. Proof of Proposition 3 

Proof. The sequence in question is uniformly integrable, so it is sufficient to show that E [P (gt(X, Z,I) ^ Y \ I)] 
L* implies the result, where the expectation is taken over the random selection of training set. 

We can write 

P (g t (X, Z,I)^Y)=E [P (g t (X, Z,I)?Y\ I)] 

' (g t (X, Z,I)^Y\I) v{I) + f P (g t (X, Z,I)^Y\I) 

By assumption v{T c ) = 0, so we have 

lim P (g t (X, Z, I)*Y)= Km / P (g t (X, Z,I)^Y\I) v{I) 

t—>OC t— >OQ Jj- 
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Since probabilities are bounded in the interval [0, 1], the dominated convergence theorem allows us to exchange 
the integral and the limit, 

= / lim V(g t {X,Z,I)j=Y\I)v{I) 
and by assumption the conditional risk converges to the Bayes risk for all / €E X, so 

=l *L u{i) 

= L* 

which proves the claim. □ 

B.5. Proof of Proposition 4 

Proof. By definition, the rule 

g(x) = argmax{?7 fe (a;)} 

k 

(where ties are broken in favour of smaller k) achieves the Bayes risk. In the case where all the rj k (x) are equal 
there is nothing to prove, since all choices have the same probability of error. Therefore, suppose there is at least 
one k such that r] k (x) < r) g ( x '(x) and define 



m[x) = rj g{ - x \x) - max{?7 fc (a;) | r] k {x) < r/ 9t - x ^(x)} 

k 

m t {x) = V 9 t (x) (x) - m&x{rj k (x) \ T) k {x) < T]^ x \x)} 



k 

The function m{x) > is the margin function which measures how much better the best choice is than the second 
best choice, ignoring possible ties for best. The function mtix) measures the margin of gt{x). If mt(x) > then 
gt(x) has the same probability of error as the Bayes classifier. 

The assumption above guarantees that there is some e such that m(x) > e. Using C to denote the number of 
classes, by making t large we can satisfy 

¥(\ V k (X)-r 1 k (X)\<e/2)>l-S/C 

since rj k is consistent. Thus 

P ( f| \Vt(X) - ?7 fe Wl < e/2 J > 1 - K + (\r,HX) - V k (X)\ < e/2) > 1 - 6 

\k=l ) k=l 

So with probability at least 1 — 6 we have 

m t (X) = V f X) ~ max{7? t fe (X) | V k (X) < ^ X \X)} 

k 

> {tj 9 ^ - e/2) - max{7/ t fc pC) + e/2 | r] k {X) < if {x) {X)} 

k 

= tj 9 ^ - max{r? fe (X) | r, k (X) < ^^(X)} - e 

= m(X) - e 

> 

Since S is arbitrary this means that the risk of g t converges in probability to the Bayes risk. □ 
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d 



Figure 6. This Figure shows the setting of Proposition 8. Conditioned on a partially built tree we select an arbitrary leaf 
at depth d and an arbitrary candidate split in that leaf. The proposition shows that, assuming no other split for A is 
selected, we can guarantee that the chosen candidate split will occur in bounded time with arbitrarily high probability. 

B.6. Proof of Theorem 1 

The proof of Theorem 1 is built in several pieces. 

Proposition 8. Fix a partitioning sequence. Let to be a time at which a split occurs in a tree built using this 
sequence, and let g ta denote the tree after this split has been made. If A is one of the newly created cells in 
gt then we can guarantee that the cell A is split before time t > to with probability at least 1 — S by making t 
sufficiently large. 

Proof. Let d denote the depth of A in the tree gt and note that fi(A) > with probability 1 since X has a 
density. This situation is illustrated in Figure 6. By construction, if the following conditions hold: 

1. For some candidate split in A, the number of estimation points in both children is at least a(d), 

2. The number of estimation points in A is at least /3(d), 

then the algorithm must split A when the next structure point arrives. Thus in order to force a split we need 
the following sequence of events to occur: 

1. A structure point must arrive in A to create a candidate split point. 

2. The above two conditions must be satisfied. 

3. Another structure point must arrive in A to force a split. 

It is possible for a split to be made before these events occur, but assuming a split is not triggered by some other 
mechanism we can guarantee that this sequence of events will occur in bounded time with high probability. 

Suppose a split is not triggered by a different mechanism. Define Eq to be an event that occurs at to with 
probability 1, and let E\ < E2 < E 3 be the times at which the above numbered events occur. Each of these 
events requires the previous one to have occurred and moreover, the sequence has a Markov structure, so for 
to < t\ < t 2 < t 3 = t we have 

V{E 1 <triE 2 <tr)E 3 <t\Eo = to)>f'{E 1 <t 1 riE 2 <t 2 riE 3 <t 3 \Eo = t ) 

= ¥(E 1 <t 1 \E Q = to) P (E 2 <t 2 \E 1 < h) P [E 3 <t 3 \E 2 < t 2 ) 
>P{E 1 <t 1 \Eo= to) P(E 2 <t 2 \E 1 =t 1 )P {E 3 <t 3 \E 2 =t 2 ) . 

We can rewrite the first and last term in more friendly notation as 

P {E 1 <h\Eo=to)=P (N t s Qtti (A) > 1) , 
P (E 3 <t 3 \E 2 =t 2 )=V (N t s 2tt3 (A) > 1) . 
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Eq E\ E2 E3 

to £3 — h 

12 — t\ 1 1 

t\ — tp 1 — 1 

Figure 7. This Figure diagrams the structure of the argument used in Propositions 8 and 9. The indicated intervals are 
show regions where the next event must occur with high probability. Each of these intervals is finite, so their sum is also 
finite. We find an interval which contains all of the events with high probability by summing the lengths of the intervals 
for which we have individual bounds. 

Lemma 7 allows us to lower bound both of these probabilities by 1 — e for any e > by making t\ — to and t 3 — t 2 
large enough that 

^^nraxjl^^logQ)} 

and 

respectively. To bound the centre term, recall that n(A') > and n(A") > with probability 1, and (3(d) > a(d) 
so 

P (E 2 <t 2 \E 1 =h)>F {m iM {A') > /3(d) n m iM (A") > 13(d)) 

> p (Nt iM (A) > m) + p (^uo4") > m) - 1 , 

and we can again use Lemma 7 lower bound this by 1 — e by making t% — t\ sufficiently large that 
K, 2 > mH J^ m} ^{^), mm{/x(A'), niA")}' 1 log (J 

Thus by setting e = 1 — (1 — S) 1 ^ 3 can ensure that the probability of a split before time t is at least 1 — 8 if we 
make 

t = t + (ti - to) + (ta - h) + (t 3 - t 2 ) 
sufficiently large. □ 

Proposition 9. Fix a partitioning sequence. Each cell in a tree built based on this sequence is split infinitely 
often in probability, i.e for any x in the support of X , 

P (A t (x) has been split fewer than K times) — > 

as t — > 00 for all K . 

Proof. For an arbitrary point x in the support of X, let E^ denote the time at which the cell containing x is split 
for the A;th time, or infinity if the cell containing x is split fewer than k times (define Eo = with probability 
1). Now define the following sequence: 

to = 

U = mm{t \¥(Ei<t\ E^ x = t^) > 1 - e} 
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and set Tg = tf.. Proposition 8 guarantees that each of the above t^s exists and is finite. Compute, 

P(E k <Ts)=¥[^[E i <Ts]\ 



\i=l 



> 



«=1 \ j<i 



k 



i=l 
k 



>'[[V(E i <ti\E i _ 1 =t i -i) 



i=l 

>(l-e) fc 

where the last line follows from the choice of ti's. Thus for any 5 > we can choose Ta to guarantee P (Ek < Tj) > 
1 — <5 by setting e = 1 — (1 — S) 1 ^ and applying the above process. We can make this guarantee for any k which 
allows us to conclude that P (Ef. < t) — > 1 as t — > oo for all k as required. □ 

Proposition 10. Fix a partitioning sequence. Let A t (X) denote the cell of gt (built based on the partitioning 
sequence) containing the point X. Then diam(A t (X)) — > in probability as t — > oo. 

Proof. Let Vt(x) be the size of the first dimension of A t (x). It suffices to show that E [T4(#)] ~~ * f° r au x m ^^e 
support of X. 

Let X\, . . . ,X m i fi\A t r x \ for some 1 < m' < m denote the samples from the structure stream that are used 
to determine the candidate splits in the cell A t (x). Use ~Kd to denote a projection onto the dth coordinate, and 
without loss of generality, assume that Vt = 1 and n\Xi ~ Uniform[0, 1]. Conditioned on the event that the first 
dimension is cut, the largest possible size of the first dimension of a child cell is bounded by 

m "? 

V* = max(max7riXi, 1 — min7riXi) . 

i=l i=l 

Recall that we choose the number of candidate dimensions as min(l + Poisson(A), D) and select that number of 
distinct dimensions uniformly at random to be candidates. Define the following events: 

E\ = {There is exactly one candidate dimension} 
E% = {The first dimension is a candidate} 

Then using V to denote the size of the first dimension of the child cell, 

E [V'\ < E [I {(E 1 n E 2 f} + I{E 1 n E 2 } V*] 

= P [El) + P (£ 2 c |£i) P (Si) + P {E 2 \E l ) P (Ex) E [V*] 

(l-e- A ) + (l-i)e- A + ie- A 

1 - V + V E W 

= 1 ^ — I — ^E max(max7TiXi, 1 — min7TiXj) 

i—l i—1 

2m + 1 

1 



e- X ) 
e~ x 


+■ (1- 

e- A 


-rT + 


D 


e- x 






e- A 


-D7 + 


D 




e- A 


e" A 
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D 
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eT 
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Iterating this argument we have that after K splits the expected size of the first dimension of the cell containing 
x is upper bounded by 



Proposition 11. Fix a partitioning sequence. In any tree built based on this sequence, N e (A t (X)) — > oo in 
probability. 

Proof. It suffices to show that N e (A t (x)) —> oo for all x in the support of X. Fix such an x, by Proposition 9 we 
can make the probability A t (x) is split fewer than K times arbitrarily small for any K. Moreover, by construction 
immediately after the _ftT-th split is made the number of estimation points contributing to the prediction at x is 
at least a(K), and this number can only increase. Thus for all K we have that P (N e (A t (x)) < a(K)) — > as 
t — > oo as required. □ 

We are now ready to prove our main result. All the work has been done, it is simply a matter of assembling the 
pieces. 

Proof ( of Theorem 1 ). Fix a partitioning sequence. Conditioned on this sequence the consistency of each of the 
class posteriors follows from Theorem 5. The two required conditions where shown to hold in Propositions 10 
and 11. Consistency of the multiclass tree classifier then follows by applying Proposition 4. 

To remove the conditioning on the partitioning sequence, note that Lemma 6 shows that our tree generation 
mechanism produces a partitioning sequence with probability 1. Apply Proposition 3 to get unconditional 
consistency of the multiclass tree. 

Proposition 2 lifts consistency of the trees to consistency of the forest, establishing the desired result. □ 




so it suffices to have K — > oo in probability, which we know to be the case from Proposition 9. 



□ 



